Disentangled Motif-aware Graph Learning for Phrase Grounding

نویسندگان

چکیده

In this paper, we propose a novel graph learning framework for phrase grounding in the image. Developing from sequential to dense model, existing works capture coarse-grained context but fail distinguish diversity of among phrases and image regions. contrast, pay special attention different motifs implied scene devise disentangled network integrate motif-aware contextual information into representations. Besides, adopt interventional strategies at feature structure levels consolidate generalize Finally, cross-modal is utilized fuse intra-modal features, where each can be computed similarity with regions select best-grounded one. We validate efficiency (DIGN) through series ablation studies, our model achieves state-of-the-art performance on Flickr30K Entities ReferIt Game benchmarks.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scalable Motif-aware Graph Clustering

We develop new methods based on graph motifs for graph clustering, allowing more efficient detection of communities within networks. We focus on triangles within graphs, but our techniques extend to other clique motifs as well. Our intuition, which has been suggested but not formalized similarly in previous works, is that triangles are a better signature of community than edges. We therefore ge...

متن کامل

Linear Disentangled Representation Learning for Facial Actions

Limited annotated data available for the recognition of facial expression and action units embarrasses the training of deep networks, which can learn disentangled invariant features. However, a linear model with just several parameters normally is not demanding in terms of training data. In this paper, we propose an elegant linear model to untangle confounding factors in challenging realistic m...

متن کامل

Knowledge Aided Consistency for Weakly Supervised Phrase Grounding

Given a natural language query, a phrase grounding system aims to localize mentioned objects in an image. In weakly supervised scenario, mapping between image regions (i.e., proposals) and language is not available in the training set. Previous methods address this deficiency by training a grounding system via learning to reconstruct language information contained in input queries from predicte...

متن کامل

Dna-gan: Learning Disentangled Represen-

Disentangling factors of variation has always been a challenging problem in representation learning. Existing algorithms suffer from many limitations, such as unpredictable disentangling factors, bad quality of generated images from encodings, lack of identity information, etc. In this paper, we proposed a supervised algorithm called DNA-GAN trying to disentangle different attributes of images....

متن کامل

Unsupervised Learning of Disentangled Representations from Video

We present a new model DRNET that learns disentangled image representations from video. Our approach leverages the temporal coherence of video and a novel adversarial loss to learn a representation that factorizes each frame into a stationary part and a temporally varying component. The disentangled representation can be used for a range of tasks. For example, applying a standard LSTM to the ti...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2021

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v35i15.17602